AITopics

Country: Europe > Sweden (0.14)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.48)

Neural Information Processing SystemsDec-27-2025, 03:25:05 GMT

Decentralized Randomly Distributed Multi-agent Multi-armed Bandit with Heterogeneous Rewards

We study a decentralized multi-agent multi-armed bandit problem in which multiple clients are connected by time dependent random graphs provided by an environment. The reward distributions of each arm vary across clients and rewards are generated independently over time by an environment based on distributions that include both sub-exponential and sub-gaussian distributions. Each client pulls an arm and communicates with neighbors based on the graph provided by the environment. The goal is to minimize the overall regret of the entire system through collaborations. To this end, we introduce a novel algorithmic framework, which first provides robust simulation methods for generating random graphs using rapidly mixing markov chains or the random graph model, and then combines an averaging-based consensus approach with a newly proposed weighting technique and the upper confidence bound to deliver a UCB-type solution. Our algorithms account for the randomness in the graphs, removing the conventional doubly stochasticity assumption, and only require the knowledge of the number of clients at initialization. We derive optimal instance-dependent regret upper bounds of order $\log{T}$ in both sub-gaussian and sub-exponential environments, and a nearly optimal instance-free regret upper bound of order $\sqrt{T}\log T$ up to a $\log T$ factor. Importantly, our regret bounds hold with high probability and capture graph randomness, whereas prior works consider expected regret under assumptions and require more stringent reward distributions.

heterogeneous reward, multi-agent multi-armed bandit, name change, (5 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.99)
Information Technology > Artificial Intelligence > Machine Learning (0.76)

Neural Information Processing SystemsDec-26-2025, 22:20:42 GMT

Statistical and Computational Trade-off in Multi-Agent Multi-Armed Bandits

We study the problem of regret minimization in Multi-Agent Multi-Armed Bandits (MAMABs) where the rewards are defined through a factor graph. We derive an instance-specific regret lower bound and characterize the minimal expected number of times each global action should be explored. Unfortunately, this bound and the corresponding optimal exploration process are obtained by solving a combinatorial optimization problem with a set of variables and constraints exponentially growing with the number of agents. We approximate the regret lower bound problem via Mean Field techniques to reduce the number of variables and constraints. By tuning the latter, we explore the trade-off between achievable regret and complexity. We devise Efficient Sampling for MAMAB (ESM), an algorithm whose regret asymptotically matches the corresponding approximated lower bound. We assess the regret and computational complexity of ESM numerically, using both synthetic and real-world experiments in radio communications networks.

multi-agent multi-armed bandit, name change, statistical and computational trade-off, (3 more...)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.65)

arXiv.org Artificial IntelligenceOct-9-2025

Distributed Algorithms for Multi-Agent Multi-Armed Bandits with Collision

Zhou, Daoyuan, Wang, Xuchuang, Yang, Lin, Gao, Yang

We study the stochastic Multiplayer Multi-Armed Bandit (MMAB) problem, where multiple players select arms to maximize their cumulative rewards. Collisions occur when two or more players select the same arm, resulting in no reward, and are observed by the players involved. We consider a distributed setting without central coordination, where each player can only observe their own actions and collision feedback. We propose a distributed algorithm with an adaptive, efficient communication protocol. The algorithm achieves near-optimal group and individual regret, with a communication cost of only $\mathcal{O}(\log\log T)$. Our experiments demonstrate significant performance improvements over existing baselines. Compared to state-of-the-art (SOTA) methods, our approach achieves a notable reduction in individual regret. Finally, we extend our approach to a periodic asynchronous setting, proving the lower bound for this problem and presenting an algorithm that achieves logarithmic regret.

artificial intelligence, data mining, machine learning, (20 more...)

2510.06683

Country: North America > United States (0.28)

Genre: Research Report (0.81)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.85)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.67)

Ledford, Michael, Regli, William

Learning to Coordinate Under Threshold Rewards: A Cooperative Multi-Agent Bandit Framework

arXiv.org Artificial IntelligenceJun-23-2025

Cooperative multi-agent systems often face tasks that require coordinated actions under uncertainty. While multi-armed bandit (MAB) problems provide a powerful framework for decentralized learning, most prior work assumes individually attainable rewards. We address the challenging setting where rewards are threshold-activated: an arm yields a payoff only when a minimum number of agents pull it simultaneously, with this threshold unknown in advance. Complicating matters further, some arms are decoys - requiring coordination to activate but yielding no reward - introducing a new challenge of wasted joint exploration. We introduce Threshold-Coop-UCB (T-Coop-UCB), a decentralized algorithm that enables agents to jointly learn activation thresholds and reward distributions, forming effective coalitions without centralized control. Empirical results show that T-Coop-UCB consistently outperforms baseline methods in cumulative reward, regret, and coordination metrics, achieving near-Oracle performance. Our findings underscore the importance of joint threshold learning and decoy avoidance for scalable, decentralized cooperation in complex multi-agent

activation threshold, agent, artificial intelligence, (18 more...)

2506.15856

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.82)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.93)

Wang, Xingyu, Xu, Mengfan

Multi-agent Multi-armed Bandit with Fully Heavy-tailed Dynamics

arXiv.org Machine LearningJan-31-2025

We study decentralized multi-agent multi-armed bandits in fully heavy-tailed settings, where clients communicate over sparse random graphs with heavy-tailed degree distributions and observe heavy-tailed (homogeneous or heterogeneous) reward distributions with potentially infinite variance. The objective is to maximize system performance by pulling the globally optimal arm with the highest global reward mean across all clients. We are the first to address such fully heavy-tailed scenarios, which capture the dynamics and challenges in communication and inference among multiple clients in real-world systems. In homogeneous settings, our algorithmic framework exploits hub-like structures unique to heavy-tailed graphs, allowing clients to aggregate rewards and reduce noises via hub estimators when constructing UCB indices; under $M$ clients and degree distributions with power-law index $\alpha > 1$, our algorithm attains a regret bound (almost) of order $O(M^{1 -\frac{1}{\alpha}} \log{T})$. Under heterogeneous rewards, clients synchronize by communicating with neighbors, aggregating exchanged estimators in UCB indices; With our newly established information delay bounds on sparse random graphs, we prove a regret bound of $O(M \log{T})$. Our results improve upon existing work, which only address time-invariant connected graphs, or light-tailed dynamics in dense graphs and rewards.

big data, data mining, multi-agent multi-armed bandit, (2 more...)

arXiv.org Machine Learning

2501.19239

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science > Data Mining > Big Data (0.60)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.60)

Neural Information Processing SystemsJan-20-2025, 01:53:05 GMT

Decentralized Randomly Distributed Multi-agent Multi-armed Bandit with Heterogeneous Rewards

We study a decentralized multi-agent multi-armed bandit problem in which multiple clients are connected by time dependent random graphs provided by an environment. The reward distributions of each arm vary across clients and rewards are generated independently over time by an environment based on distributions that include both sub-exponential and sub-gaussian distributions. Each client pulls an arm and communicates with neighbors based on the graph provided by the environment. The goal is to minimize the overall regret of the entire system through collaborations. To this end, we introduce a novel algorithmic framework, which first provides robust simulation methods for generating random graphs using rapidly mixing markov chains or the random graph model, and then combines an averaging-based consensus approach with a newly proposed weighting technique and the upper confidence bound to deliver a UCB-type solution.

graph, heterogeneous reward, multi-agent multi-armed bandit, (3 more...)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)

Neural Information Processing SystemsJan-20-2025, 00:03:19 GMT

Statistical and Computational Trade-off in Multi-Agent Multi-Armed Bandits

We study the problem of regret minimization in Multi-Agent Multi-Armed Bandits (MAMABs) where the rewards are defined through a factor graph. We derive an instance-specific regret lower bound and characterize the minimal expected number of times each global action should be explored. Unfortunately, this bound and the corresponding optimal exploration process are obtained by solving a combinatorial optimization problem with a set of variables and constraints exponentially growing with the number of agents. We approximate the regret lower bound problem via Mean Field techniques to reduce the number of variables and constraints. By tuning the latter, we explore the trade-off between achievable regret and complexity.

complexity, multi-agent multi-armed bandit, statistical and computational trade-off, (1 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.66)

Kotturu, Monish Reddy, Movahed, Saniya Vahedian, Robinette, Paul, Jerath, Kshitij, Redlich, Amanda, Azadeh, Reza

Relational Weight Optimization for Enhancing Team Performance in Multi-Agent Multi-Armed Bandits

arXiv.org Artificial IntelligenceOct-30-2024

Using a graph to represent the team behavior ensures that the relationship between Multi-Armed Bandits (MABs) are a class of reinforcement the agents are held. However, existing works either do learning problems where an agent is presented with a set of not consider the weight of each relationship (graph edges) arms (i.e., actions), with each arm giving a reward drawn (Madhushani and Leonard, 2020; Agarwal et al., 2021) or from a probability distribution unknown to the agent expect the user to manually set those weights (Moradipari (Lattimore and Szepesvári, 2020). The goal of the agent et al., 2022). is to maximize its total reward which requires balancing In this paper, we propose a new approach that combines exploration and exploitation. MABs offer a simple model graph optimization and MAMAB algorithms to enhance to simulate decision-making under uncertainty. Practical team performance by expediting the convergence to consensus applications of MAB algorithms include news recommendations of arm means. Our proposed approach: (Yang and Toni, 2018), online ad placement (Aramayo et al., 2022), dynamic pricing (Babaioff et al., 2015), improves team performance by optimizing the edge and adaptive experimental design (Rafferty et al., 2019). In weights in the graph representing the team structure contrast to single-agent cases, in certain applications such in large constrained teams, as search and rescue, a team of agents should cooperate does not require manual tuning of the graph weights, with each other to accomplish goals by maximizing team is independent of the MAMAB algorithm and only performance. Such problems are solved using Multi-Agent depends on the consensus formula, and Multi-Armed Bandit (MAMAB) algorithms (Xu et al., formulates the problem as a convex optimization, which 2020). Most existing algorithms rely on the presence of is computationally efficient for large teams.

agent, algorithm, consensus, (14 more...)

2410.23379

Country:

North America > United States > Massachusetts > Middlesex County > Lowell (0.15)
North America > United States > Texas > Bexar County > San Antonio (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (1.00)

Industry: Education (0.34)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

arXiv.org Artificial IntelligenceAug-20-2024

Multi-agent Multi-armed Bandits with Stochastic Sharable Arm Capacities

Xie, Hong, Mo, Jinyu, Lian, Defu, Wang, Jie, Chen, Enhong

Motivated by distributed selection problems, we formulate a new variant of multi-player multi-armed bandit (MAB) model, which captures stochastic arrival of requests to each arm, as well as the policy of allocating requests to players. The challenge is how to design a distributed learning algorithm such that players select arms according to the optimal arm pulling profile (an arm pulling profile prescribes the number of players at each arm) without communicating to each other. We first design a greedy algorithm, which locates one of the optimal arm pulling profiles with a polynomial computational complexity. We also design an iterative distributed algorithm for players to commit to an optimal arm pulling profile with a constant number of rounds in expectation. We apply the explore then commit (ETC) framework to address the online setting when model parameters are unknown. We design an exploration strategy for players to estimate the optimal arm pulling profile. Since such estimates can be different across different players, it is challenging for players to commit. We then design an iterative distributed algorithm, which guarantees that players can arrive at a consensus on the optimal arm pulling profile in only M rounds. We conduct experiments to validate our algorithm.

algorithm, borderline element, optimal arm, (13 more...)

2408.10865

Country: Asia > China > Chongqing Province > Chongqing (0.04)

Genre: Research Report (0.40)

Industry: Transportation (0.47)

Technology:

Information Technology > Data Science > Data Mining > Big Data (0.62)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.51)